{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perceptron\n",
    "=======\n",
    "\n",
    "The perceptron algorithm was invented in 1957 by Frank Rosenblatt.\n",
    "\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/perceptron_nyt.png\" width=\"650\">\n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The learning rule\n",
    "--------------------\n",
    "\n",
    "The learning rule of the Perceptron is based on [Hebbian learning](https://en.wikipedia.org/wiki/Hebbian_theory). Rosenblatt was inspired by Donald Hebb and his 1949 book called *\"The Organization of Behavior\"*. The core idea of the book is the following:\n",
    "\n",
    "*When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased.*\n",
    "\n",
    "This simple but poweful rule has been called the **Hebb's rule** and it is the learning principle used in the original Perceptron. The three principal rule behind the learning phase can be described as follows:\n",
    "\n",
    "1. If the output is correct, leave its weights **unchanged**\n",
    "2. If the output incorrectly outputs zero, **add** the input vector to the weight vector\n",
    "3. If the output incorrectly outputs a one, **subtract** the input vector from the weight vector\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Geometric interpretation\n",
    "----------------------------\n",
    "\n",
    "The geometric interpretation of the learning rule can be understood thinking in terms of vectors and planes. We start from the **weight space** that is the space containing all the possible configurations of weights in the Perceptron. If the network only has two weights then the weight space is a Cartesian space. If the network has 14 weights then the weight space is a 14-dimensional hyperspace. To simplify we do not take into account the bias. In this simplified version each training case can be represented as a hyperplane passing through the origin. This hyperplane separate the configuration of weights that leads to a correct output from the configuration that leads to a wrong output. A point in the space represents the particular setting of all the weights. A vector starting from the origin and arriving to the point identifies a peculiar setting. In the following image (taken from Hinton course on Coursera) it is possible to see this geometric representation:\n",
    "\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/perceptron_geometry.png\" width=\"300\">\n",
    "</p>\n",
    "\n",
    "\n",
    "The black line is the hyperplane that separate the weight space in two parts. The **blue vector** is an input vector passed to the perceptron. The correct answer associated to the blue vector is zero, meaning that it is orthogonal to the hyperplane. A specific configuration of weights is represented with a **red vector**. The red vector represents a wrong configuration of weights because the scalar (dot) product between it and the blue vector has a wrong sign. At the same time the **green vector** represents the correct configuration of weights because the dot product with the blue vector has the correct sign.\n",
    "\n",
    "Here we recall what is a dot product between two vectors $\\boldsymbol{a}$ and $\\boldsymbol{b}$:\n",
    "\n",
    "$$ \\boldsymbol{a} \\cdot \\boldsymbol{b} = |\\boldsymbol{a}| \\ |\\boldsymbol{b}| \\ cos \\ \\theta$$\n",
    "\n",
    "The $cos$ operator decides the sign of the resulting scalar. When $0 < \\theta < \\frac{\\pi}{2}$ the sign is positive, when $\\theta = \\frac{\\pi}{2}$ the result is zero, and when $\\theta > \\frac{\\pi}{2}$ the sign is negative. The same rule applies for angles $> \\pi$. The red vector represent a bad configuration of weights because the dot product with the blue vector leads to a positive scalar, that once passed through the sign activation function gives a values of one. On the other hand the green vector returns a negative scalar that lead to the correct output of zero once passed through the activation function.\n",
    "\n",
    "The **goal** of the learning is to find a good configuration of weights that satisfies all the learning cases. For instance if another learning case is to give a correct ansewer of 1 then we need two hyperplanes and the good configuration for the weights lay between them. This example is represented in the following image (taken from Hinton's Coursera slides):\n",
    "\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/perceptron_geometry_2.png\" width=\"300\">\n",
    "</p>\n",
    "\n",
    "An important point here is that the optimal configuration of weights **may not exist**! However, if the vector exists, then it lies in a hypercone with its apex in the origin, meaning that all the vectors inside that cone are good solutions to the problem.\n",
    "\n",
    "At **training time** the weight vector is moved closer or farther from the input vector (blue vector). If we consider again the three rules discussed in the previous section, we can get the geometric interpretation fo them. Here the rules are showed again:\n",
    "\n",
    "1. If the output is correct, leave its weights unchanged\n",
    "2. If the output incorrectly outputs zero, add the input vector to the weight vector\n",
    "3. If the output incorrectly outputs a one, subtract the input vector from the weight vector\n",
    "\n",
    "\n",
    "applying the three rules to the weight vector we get as result that the weight vector is moved (rule 2 and 3). If there is an hypercone where all the training criteria are satisfied, then the weight vector will reach that cone and will stay there (repeating rule 1).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Input manipulation\n",
    "----------------------\n",
    "\n",
    "The input to the perceptron can be manipulated in different ways in order to simplify the gradient descent process.\n",
    "For a linear unit like the perceptron, the error surface is always a quadratic bowl. This bowl can assume different shapes depending on the type of input we are passing. Here I discuss two methods: shifting and scaling. In both cases we suppose that the perceptron has only two inputs and one output, in this way we can represent the error surface in a three-dimensional space. To understand the advantage of the two methods the reader must remember that the gradient is a vector perpendicular to the error surface. Here I will graphically represent the error surface with a cross-section (black ellipse/circle), the weight axes with two lines (blue and gree), and the gradient with a red arrow.\n",
    "\n",
    "**Shifting the input**: let's suppose the dataset has these two samples: $x_{1}=\\{101, 101 \\}$ and $x_{2}=\\{101, 99\\}$, with targets: $t_{1}=\\{2\\}$ and $t_{2}=\\{0\\}$. The shift consist in transforming each component of the input vector so that the dataset has *zero mean*. In our case the shifting will produce two new training samples: $\\hat{x}_{1} = \\{ 1, 1 \\}$ and $\\hat{x}_{2} = \\{ 1, -1 \\}$ with same targets. The geometric interpretation of this operation is that at the beginning we had an elongated bowl for the error surface (with elliptical cross-section), while after the shifting the two dimensions are equal (circular cross-section). This is great because now the gradient point toward the center, meaning that it will converge much faster to the minima than in the previous case. A graphical representation with a cross-section of the error surface for the two conditions is given below:\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/perceptron_shifting.png\" width=\"500\">\n",
    "</p>\n",
    "\n",
    "**Scaling the input**: now we suppose that the dataset for the two inputs is the following: $x_{1}=\\{0.1, 10 \\}$ and $x_{2}=\\{0.1, -10\\}$, with targets: $t_{1}=\\{2\\}$ and $t_{2}=\\{0\\}$. The scaling consists in manipulating the inputs to have *unit variance* on the entire dataset. Applying the scaling the new inputs are the following: $\\hat{x}_{1} = \\{ 1, 1 \\}$ and $\\hat{x}_{2} = \\{ 1, -1 \\}$. The geometrical interpretation is similar to the previous operation, the shape of the error surface get scaled to a circular bowl where the gradient descent is much faster.\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/perceptron_scaling.png\" width=\"500\">\n",
    "</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
